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ABSTRACT 

Network  topologies  can  have  significant  effect  on  the  costs  of 
algorithms  due  to  inter-processor  communication.  Parallel 
algorithms  that  ignore  network  topology  can  suffer  from  con¬ 
tention  along  network  links.  However,  for  particular  combi¬ 
nations  of  computations  and  network  topologies,  costly  net¬ 
work  contention  may  inevitably  become  a  bottleneck,  even 
for  optimally  designed  algorithms.  We  obtain  a  novel  con¬ 
tention  lower  bound  that  is  a  function  of  the  network  and 
the  computation  graph  parameters.  To  this  end,  we  com¬ 
pare  the  communication  bandwidth  needs  of  subsets  of  pro¬ 
cessors  and  the  available  network  capacity  (as  opposed  to 
per-processor  analysis  in  most  previous  studies).  Applying 
this  analysis  we  improve  communication  cost  lower  bounds 
for  several  combinations  of  fundamental  computations  on 
common  network  topologies. 


Categories  and  Subject  Descriptors 

F.2.1  [Analysis  of  Algorithms  and  Problem  Complex¬ 
ity]:  Numerical  Algorithms  and  Problems — Computations 
on  matrices 

General  Terms 

Algorithms,  Design,  Performance. 

Keywords 

Network  topology,  Communication- avoiding  algorithms,  Strong 
scaling,  Communication  costs. 


‘Current  affiliation:  Google  Inc. 


1.  INTRODUCTION 

Good  connectivity  of  the  inter  processor  network  is  nec¬ 
essary  for  fast  execution  of  parallel  algorithms.  Insufficient 
graph-expansion  of  the  network  provably  slows  down  specific 
parallel  algorithms  that  are  communication  intensive.  While 
parallel  algorithms  that  ignore  network  topology  can  suf¬ 
fer  from  contention  along  network  links,  for  particular  com¬ 
binations  of  computations  and  network  topologies,  costly 
network  contention  may  be  inevitable,  even  for  optimally 
designed  algorithms.  In  this  paper  we  obtain  novel  lower 
bounds  on  such  contention  cost,  and  point  to  cases  where 
this  cost  is  a  performance  bottleneck. 

We  use  a  variant  of  the  distributed-memory  communica¬ 
tion  model  (cf,  [16,  19,  11]),  where  the  bandwidth-cost  of  an 
algorithm  is  proportional  to  the  number  of  words  communi¬ 
cated  by  one  processor  (we  omit  the  latency  cost  /  message 
count  discussion  from  this  work).  As  in  the  distributed- 
memory  communication  model  we  have  P  processors  and  a 
local  memory  of  size  M  for  each  processor.  However,  here, 
we  do  not  assume  all-to-all  connectivity,  but  rather  some 
network  graph  Gjvet  with  P  vertices.  In  this  work  we  assume 
all  edges  (network  links)  have  the  same  bandwidth,  and  the 
nodes  of  the  network  are  both  processors  and  routers  (i.e.  a 
direct  network,  where  no  node  is  solely  a  router).  We  ignore 
processor  injection  rates  in  this  model. 

Most  previous  communication  cost  lower  bounds  for  par¬ 
allel  algorithms  utilize  per-processor  analysis.  That  is,  the 
lower  bounds  establish  that  some  processor  must  communi¬ 
cate  a  given  amount  of  data.  These  include  classical  ma¬ 
trix  multiply,  direct  and  iterative  linear  algebra  algorithms, 
FFT,  Strassen  and  Strassen-like  fast  algorithms,  graph  re¬ 
lated  algorithms,  TV-body,  sorting,  and  others  (cf.  [3,  25,  23, 
32,  26,  11,  9,  15,  22,  8,  28,  35,  21,  34]). 

By  considering  the  network  graphs,  we  introduce  com¬ 
munication  lower  bounds  for  certain  computations  and  net¬ 
works  that  are  tighter  than  what  was  previously  known.  We 
bound  from  below  the  number  of  words  communicated  be¬ 
tween  a  subset  of  processors  and  the  rest  of  the  processors  for 
a  given  parallel  algorithm  (defined  by  a  computation  graph 
and  work  assignment  to  the  processors),  and  divide  it  by  the 
number  of  words  that  the  network  is  capable  of  communi¬ 
cating  simultaneously  between  that  subset  of  processors  and 
the  rest  of  the  graph.  This  relates  to  the  contention  cost  of 
the  algorithm,  which  we  specify  in  Definition  2.2.  Applying 
the  main  theorem  we  improve  (i.e.,  increase)  communication 
cost  lower  bounds  for  several  combinations  of  fundamental 


computations  on  common  network  topologies.  Note  that  we 
inherit  any  assumptions  made  in  the  original  per-processor 
lower  bounds,  e.g.  no  recomputation.  These  contention 
bounds  may  suggest  directions  for  hardware/network  design 
tailored  for  heavily  used  computation  kernels  and  may  as¬ 
sist  when  scheduling  users’  applications  on  (a  subset  of)  a 
supercomputer. 

2.  CONTENTION  LOWER  BOUND 

In  this  section  we  state  our  main  result,  which  translates 
per-processor  bandwidth  cost  lower  bounds  to  contention 
cost  lower  bounds.  The  following  definitions  differentiate 
these  costs. 

Definition  2.1.  Let  a  parallel  algorithm  be  run  on  a  par¬ 
allel  distributed-memory  machine  with  P  processors.  The 
per-processor  bandwidth  cost  Wpr0c  is  the  maximum  over 
processors  1  <  p  <  P  of  the  number  of  words  sent  or  re¬ 
ceived  by  processor  p. 

Observe  that  for  Wploc  we  can  plug  in  two  types  of  per- 
processor  lower  bounds:  memory-independent  WpTOC(P,N) 
(cf.  [9])  and  memory-dependent  Wproc(P,  M,  N)  (cf.  [26, 
11,  12,  10,  21])  where  N  is  the  input  and  output  data  size. 

Definition  2.2.  Let  a  parallel  algorithm  be  run  on  a  par¬ 
allel  distributed-memory  machine  with  network  graph  GNet  = 
(V,  E )  where  V  and  E  are  the  set  of  nodes  and  network  links 
in  GNet,  respectively.  The  contention  cost  Wnnk  is  the  maxi¬ 
mum  over  edges  e  £  E  of  the  number  of  words  communicated 
along  e  during  the  execution  of  the  algorithm. 

In  order  to  prove  our  result,  we  will  use  graph  expan¬ 
sion  analysis.  Recall  that  the  small  set  expansion  hs(G)  of 
a  d-regular  graph  G  =  ( V,E )  is  the  minimum  normalized 
number  of  edges  leaving  a  set  of  vertices  of  size  at  most  s. 
Formally,  for  s  <  |y|/2,  we  have 


hs(G) 


■  \E(S,V\S)\ 

scv;|s|<s  |S(5)| 


where  E(S)  is  the  set  of  edges  that  have  at  least  one  endpoint 
in  vertex  subset  S  and  E(S,  S)  is  the  set  of  edges  with  only 
one  endpoint  in  S.  The  cardinality  of  a  set  S  is  represented 
by  |S'|.  In  the  case  of  d-regular  graphs,  |f5(5')|  <  d\S\. 


Theorem  2.3.  Consider  a  distributed-memory  machine 
with  P  processors,  each  with  local  memory  of  size  M ,  and  an 
inter-processor  network  graph  GNet-  Given  a  computation 
with  input  and  output  data  size  N,  and  lower  bound  on  the 
memory- dependent  per-processor  bandwidth  cost  Wpr0c{P ,  M,  N), 
for  all  algorithms  that  distribute  the  workload  so  that  ev¬ 
ery  processor  performs  0,(1 /P)  of  the  computation,  and  dis¬ 
tributing  the  input  and  output  data  such  that  every  processor 
stores  0(1/ P)  of  the  data,  the  memory -dependent  contention 
cost  Wunk(P,  M,  N )  is  bounded  below  by 


Wunk(P,  M,  N)  >  max 


Wproc{P/t,  M  ■  t,  N) 

d-t-  ht(GNet) 


where 


T  =  {t  :  1  <  t  <  P/2, 3S  C  Vs.t. 

|S|  =  t  and 

ht(G)  =  \E(S,V\S)\/\E(S)\}. 


Proof.  Consider  a  partitioning  of  the  P  processors  into 
P/t  subsets  of  size  t  G  T  (w.l.o.g.,  P  is  divisible  by  t),  where 
at  least  one  of  the  subsets  St  is  connected  to  the  rest  of  the 
network  graph  with  at  most  d-t  ■  ht(GNet)  edges.1  The 
existence  of  such  a  set  St  is  guaranteed  by  the  dehnition  of 
hs (G Net')  and  T.  Then  St.  has  a  total  of  M  ■  t  local  mem¬ 
ory.  By  the  workload  distribution  assumption,  the  proces¬ 
sors  in  St  perform  a  fraction  0(t/P)  of  the  flops,  and  by  the 
data  distribution  assumption,  St  has  local  access  to  frac¬ 
tion  0(t/P)  of  the  input/output.  Hence  we  can  emulate 
this  computation  by  a  parallel  machine  with  P/t  processors, 
each  with  M  ■  t  local  memory  (see  Figure  1),  and  apply  the 
corresponding  per-processor  lower  bound  deducing  that  the 
processors  in  St  require  at  least  Wploc(P/t,  M  ■  t,N)  words 
to  be  sent/received  to  the  processors  outside  St  throughout 
the  running  of  the  algorithm.  At  most  d-t  ■  ht(GNet)  edges 
connect  St  to  the  rest  of  the  graph.  Hence  at  least  one  edge 
communicates  at  least  Wpd°t.^.(GN  t)N'>  wor<^s-  Since  t  is  a 
free  parameter,  we  can  pick  it  to  maximize  Wnnk(P,  M,  N), 
and  the  theorem  follows.  □ 

Note  that  the  memory-independent  contention  lower  bound, 
Wlink  =  Wlink(P,  N),  follows. 


Figure  1:  Computation  of  t  =  4  processors  on  a 
16-processor  machine  can  be  emulated  as  the  com¬ 
putation  of  one  processor  on  a  4-processor  machine. 


3.  PRELIMINARIES 

3.1  Per-Processor  Lower  Bounds 

Before  deriving  bounds  on  link  contention,  we  review  the 
per-processor  communication  bounds  for  several  classes  of 
algorithms. 

Classical  Linear  Algebra. 

Most  classical  direct  linear  algebra  computations  can  be 
specified  by  three  nested  loops,  and  for  dense  nxn  matrices, 
the  number  of  flops  performed  is  0(n3).2  Informally,  such 
computations,  which  include  matrix  multiplication,  Cholesky 
and  LU  decompositions,  and  many  others,  can  be  defined  by 

Cij  —  fij({gijk(Aik,Bkj)}i<k<n)  for  l<*,j<n  (1) 

where  /  and  g  are  sets  of  functions  particular  to  the  compu¬ 
tation.  For  example,  in  the  case  of  classical  matrix  multipli¬ 
cation,  fij  is  a  summation  and  g,,  *.  is  a  scalar  multiplication 
for  all  i,j,  k.  For  a  more  formal  definition,  see  [7,  Definition 

1Note  that  St  is  connected  to  the  rest  of  the  network  graph 
with  exactly  d  ■  t  ■  ht(GNet)  edges  only  when  |.E(S)|  =  d\S\. 

2  For  matrix  computations,  we  denote  the  size  of  the  in¬ 

put/output  to  be  N  =  0(n2). 


4.1].  For  such  computations,  we  have  the  following  lower 
bound: 

Theorem  3.1  ([11], [26]).  Consider  an  algorithm  per¬ 

forming  a  computation  of  the  form  given  by  equation  (1) 
on  P  processors,  each  with  local  memory  of  size  M,  and  as¬ 
sume  one  copy  of  the  input  data  is  initially  distributed  across 
processors  and  the  computation  is  load  balanced.  Then  the 
number  of  words  some  processor  must  communicate  is  at 
least 

Wproc(P,  M,  N)  =  Q  )  • 

Note  that  the  local  memory  size  M  appears  in  the  denom¬ 
inator  of  the  expression  above,  which  is  why  we  refer  to  it 
as  the  memory-dependent  bound.  Additionally,  such  com¬ 
putations  also  inherit  a  memory-independent  lower  bound: 

Theorem  3.2  ([9]).  Consider  an  algorithm  performing 

a  computation  of  the  form  given  by  equation  (1)  on  P  pro¬ 
cessors,  and  assume  just  one  copy  of  the  input  data  is  ini¬ 
tially  distributed  across  processors  and  the  computation  is 
load  balanced.  Then  the  number  of  words  some  processor 
must  communicate  is  at  least 

Wrm(p,N)=n(^y 

Strassen-like  Matrix  Multiplication. 

Similar  lower  bounds  exist  for  Strassen’s  matrix  multipli¬ 
cation  and  similar  algorithms,  though  the  proof  techniques 
differ  substantially.  Informally,  we  use  the  term  “Strassen- 
like”  to  refer  to  algorithms  that  recursively  multiply  matri¬ 
ces  according  to  a  base-case  computation.  For  square  al¬ 
gorithms,  this  corresponds  to  multiplying  no  x  no  matrices 
with  mo  scalar  multiplications,  where  no  and  mo  are  con¬ 
stants.  Using  recursion,  this  results  in  a  square  matrix  mul¬ 
tiplication  flop  count  of  0(n“°)  where  u>o  =  log„0  mo-  Note 
that  additional  technical  assumptions  are  required  for  the 
communication  lower  bounds  to  apply  and  that  Strassen- 
like  algorithms  may  have  a  rectangular  base  case;  see  [12, 
Section  5.1]  for  more  details.  The  memory-dependent  com¬ 
munication  lower  bound  for  Strassen-like  algorithms  is: 

Theorem  3.3  ([12,  Corollary  1.5]).  Consider  a 

Strassen-like  matrix  multiplication  algorithm  that  requires 
0(n“°)  total  flops.  Suppose  a  parallel  algorithm  performs 
the  computation  using  P  processors  (each  with  local  mem¬ 
ory  of  size  M),  load  balances  the  flops,  and  performs  no 
redundant  computation.  Then  the  number  of  words  some 
processor  must  communicate  is  at  least 

(  n“°  \ 

Wproc(P,M,N)^ 

Additionally,  such  computations  also  inherit  a  memory- 
independent  lower  bound: 


Proof.  Identical  to  Theorem  2.1  in  [9],  with  u>o  replacing 
logfll.  □ 

Programs  Referencing  Arrays. 

The  model  defined  in  Equation  (1)  encompasses  most  di¬ 
rect  linear  algebra  computations,  but  lower  bounds  can  be 
obtained  for  a  more  general  set  of  computations.  In  par¬ 
ticular,  Christ  et  al.  [21]  consider  programs  of  the  following 
form: 

for  all  T  £  Z  C  Zd,  in  some  order, 

inner  loop)!,  (Ax,. . .  ,Am),  (<j> i, . . 

where  Zd  is  the  d-dimensional  space  of  integers  and  inner_loop() 
represents  a  computation  involving  arrays  A\,...,Am  of  di¬ 
mensions  di,...,dm  that  are  referenced  by  the  correspond¬ 
ing  subscripts  <j> i(X), ...,  <j>m( Z)  where  <j>i  are  affine  maps  <f>j  : 

Zd  — >  Zdj  for  iteration  1  s=  (ii, ...,  id).  For  example,  matrix- 
matrix  multiplication  has  [A\,  A3,  A3)  =  ( A,B,C ),  = 

4>i(ii,  *2,  *3)  =  (*i,  *3),  =  </>2(*l,  *2,  *3)  =  (*3,  *2),  <j> 3(1)  = 

4>3(ii,  *2,  *3)  =  (ii,*2)  and  the  function  inner_loop()  is  de¬ 
fined  as  A3(MI))  =  M<h@))  +  Ai{<j>i{I))  *  A2(0a(X)). 

Because  the  work  inside  the  loop  is  currently  defined  as 
a  general  function,  the  space  of  potential  executions  of  in- 
ner  loop()  must  be  restricted  in  a  manageable  manner,  or 
to  “legal  parallel  executions”  as  defined  in  [21].  To  express 
the  lower  bounds,  we  define  a  set  of  linear  constraints  on  a 
vector  of  unknown  scalars  (si, ...,  sm) 

771 

rank(df)  <  shrank (4>j{ H)),  (3) 

i=i 

for  all  subgroups  H  of  Z  ,  where  rank (H)  is  the  cardinality 
of  any  maximal  subset  of  Abelian  group  H  that  is  linearly 
independent. 3  For  such  computations  we  have  the  following 
lower  bound: 

Theorem  3.5  ([21]).  Consider  an  algorithm  perform¬ 

ing  a  computation  of  the  form  given  by  equation  (2)  on  P 
processors,  each  with  local  memory  of  size  M ,  and  assume 
the  input  data  is  initially  evenly  distributed  across  proces¬ 
sors.  Then  for  any  legal  parallel  execution  and  sufficiently 
large  \Z\/P,  the  number  of  words  some  processor  must  com¬ 
municate  is  at  least 

HW(P,M,JV)  =  n( 

where  shbl  is  the  minimum  value  ofYfZLi  Si  subject  to  (3), 
assuming  that  this  linear  program  is  feasible  (see  [21]). 

We  restate  the  memory-independent  bound  from  [21]  for 
such  computations  (note  that  the  formal  proof  has  not  yet 
appeared).  For  legal  parallel  executions  of  computations  of 
the  form  (2)  on  P  processors,  some  processor  must  move 


Theorem  3.4.  Suppose  a  parallel  algorithm  performs  a 
Strassen-like  matrix  multiplication  algorithm  requiring  0(n“°) 
flops,  load  balances  the  computation  across  P  processors, 
and  performs  no  redundant  computation.  Then  under  some 
technical  assumptions  (see  [12])  the  number  of  words  some 
processor  must  communicate  is  at  least 

wproc(p,N)  =  ny^. 


wptoc(p,n)  =  n 


^^y/3HBL 


(4) 


words  where  N  is  the  sum  of  the  sizes  of  arrays  { Ai }  (as¬ 
sumed  to  be  evenly  distributed  across  processors)  and  shbl 
is  defined  as  in  Theorem  3.5.  In  most  cases,  the  negative 

3The  rank  of  an  Abelian  group  is  analogous  to  the  concept 
of  the  dimension  of  a  vector  space. 


term  in  the  expression  is  asymptotically  dominated  and  can 
be  ignored. 

Note  that  Theorem  3.5  generalizes  Theorem  3.1.  For  ex¬ 
ample,  matrix  multiplication  satisfies  both  forms  (1)  and 
(2),  where  in  the  latter  case  \Z\  =  n3  and  shbl  =  3/2. 

Theorem  3.5  also  applies  to,  for  example,  IV-body  compu¬ 
tations  where  all  pairs  of  interactions  are  computed.  In  the 
this  case,  \Z\  —  Q(N2)  and  shbl  =  2,  yielding  lower  bounds 
of  Wpmc{P ,  M,  N )  =  Q(N2/(PM))  and  Wploc(P,  N)  = 

Q(N/ P1/2).  We  also  note  that  Theorem  3.5  applies  to  N- 
body  computations  that  use  a  distance  cutoff  to  reduce  the 
number  of  neighbor  interations,  i.e.  \Z\  <C  N2. 

FFT/Sorting. 

We  are  unaware  of  any  memory-dependent  lower  bound 
per-processor  bound  for  the  FFT,  although  a  sequential  lower 
bound  was  proven  by  Hong  and  Kung  [25].  A  parallel  memory- 
independent  per-processor  bound  has  been  proven  in  the 
LPRAM  [4]  and  the  BSP  models  of  computation  [15].  The 
LPRAM  model  lower  bound  implies  asymptotically  the  same 
lower  bound  for  our  distributed  parallel  model: 

Theorem  3.6  ([4]).  Given  an  algorithm  that  computes 

an  n-input  FFT  digraph  a  LPRAM  model  of  computation 
with  P  processors,  and  no  recomputation  is  allowed,  then 
the  I/O  complexity  of  the  algorithm  is 

W  (P  N)  =  Q  (  rcfofffo)  \ 

wproc(T,iV  uyPlog{n/P))- 

3.2  Small  Set  Expansion  of  Various  Networks 

We  next  demonstrate  our  bounds  on  several  classes  of  al¬ 
gorithms  on  a  particular  pair  of  networks:  D-dimensional 
tori  and  meshes. 

Toroidal  networks  are  common  topologies  amongst  super¬ 
computers,  with  IBM’s  Blue  Gene/L  [2]  and  Blue  Gene/P 
[1]  machines  possessing  3D  tori.  In  Blue  Gene/Q,  IBM  used 
a  5-dimensional  torus  [20]  and  the  K  computer  in  Japan  uti¬ 
lizes  a  6-dimensional  network  topology  [6[.  Intel  Xeon  Phi 
coprocessors  rely  on  a  ring-based  (a  1-dimensional  torus) 
on-chip  communication  network  between  cores  [27].  In  this 
paragraph,  we  derive  a  tight  bound  on  the  network  small  set 
expansion  for  this  class  of  networks. 

The  D-dimensional  torus  or  mesh  graph  Gjvet  has  degree 
at  most  d  =  O(D)  and  the  small  set  expansion  shown  below. 
We  treat  D  here  as  a  constant.  For  a  fixed  dimension  D  the 
bounds  are  tight,  up  to  a  constant  factor.  For  a  tighter 
analysis  of  these  graphs,  see  [17]. 

Lemma  3.7.  Let  G  be  a  D-dimensional  torus  or  mesh, 
with  kD  vertices.  Then  asymptotically  in  s, 

hs(G)  =  0  (s“1AD)  • 

Proof.  For  an  upper  bound  on  hs(G)  consider  a  subset 
S  £  V(G)  which  is  a  D-dimensional  submesh  of  length  s1'0 
in  each  dimension.  The  number  of  neighbors  of  this  sub¬ 
mesh  on  each  of  its  2D  faces  is  0(s  D  ).  Thus  \E(S,V  \ 
S')!  =  2D  •  0(s  D  ).  The  number  of  vertices  of  S  is  s. 
The  degree  of  each  vertex  is  O(D).  Hence  hs(G)  <  2D  ■ 
0(s(d-1)/d)/(0(D)s)  =  0(s~1/D). 

For  a  lower  bound  on  hs(G)  we  use  the  Loomis- Whitney 
inequality  [30] .  Consider  a  set  S  C  V (G)  of  size  s  <  V ( G ) /2. 


Let  Ai,  A 2, ...,  Ad  be  the  projections  of  S  onto  the  (D  —  1)- 
dimensional  coordinate  hyperplanes;  let  ai,...,ao  be  their 
corresponding  sizes.  Then  by  the  Loomis- Whitney  inequal¬ 
ity  we  have  sD_1  <  rii<i<z?  ai-  Letting  m  =  argmax^a;}, 
we  have  g1_1/D  <  am.  Consider  the  “pencil”  of  vertices  that 
corresponds  to  a  point  in  Am:  if  there  exists  a  vertex  in 
the  pencil  that  is  not  in  S,  then  the  pencil  contributes  at 
least  one  edge  to  the  cut  E(S,V\S).  We  say  such  a  pen¬ 
cil  is  partially  full.  We  later  show  that  there  are  at  least 
(1  —  1/2 1^D)arn  partially-full  pencils.  Thus  they  contribute 
a  total  of  at  least  (1  —  1/2 1^D)am  >  (1  —  1/2 1/D)S1~1/D 
edges  to  the  cut.  Hence  hs(G)  >  (l  —  l/21lD)/(2 D-s1^0)  = 
f To  see  that  the  number  of  partially-full  pencils  is 
indeed  at  least  (1  —  1/2 1^D)am,  assume  for  the  sake  of  con¬ 
tradiction  that  more  than  am,l‘2/^D  pencils  are  full  (i.e.  have 
all  their  vertices  in  S').  This  implies  that  s  >  kam /21iD  > 
fcs1- 1/D/ 21/z? ,  thus  s  >  kD / 2  =  | V” | /2,  which  is  a  contra¬ 
diction  since  s  <  |V|/2.  □ 


4.  APPLICATIONS 


4.1  Deriving  the  Contention  Lower  Bounds 

In  this  section,  we  derive  contention  lower  bounds  by  plug¬ 
ging  the  memory-dependent  and  memory-independent  per- 
processor  lower  bounds  [26,  12,  9,  21]  into  Theorem  2.3  and 
using  the  properties  of  D-dimensional  tori.  Table  1  summa¬ 
rizes  these  results.  In  the  algebra  that  follows,  we  assume 
the  network  topology  to  be  a  D-dimensional  torus  or  mesh. 

Direct  Linear  Algebra,  Strassen,  Stras sen-like,  0(n2) 
n-body  algorithms. 

We  apply  Theorem  2.3  to  the  relevant  per-processor  bounds 
given  in  Section  3.1.  Let  F  denote  the  number  of  work  op¬ 
erations  (e.g.  flops  or  loop  iterations)  of  the  different  com¬ 
putations.  The  per-processor  memory-dependent  bound  is 
thus: 

WpiOC(P,  M,  N)  =  Q  (5) 

where  a  =  3/2  for  direct  dense  linear  algebra,  a  =  u>o/2 
for  Strassen-like  matrix  multiplication,  a  =  2  for  the  0(n2) 
n-body  problem.  We  next  apply  Theorem  2.3  to  (5).  By 
Lemma  3.7,  for  a  D-dimensional  torus,  the  denominators 
of  the  contention  bounds  in  Theorem  2.3  and  Expression 
(??)  are  2D  •  t  ■  0(f_1/D).  Thus,  the  memory-dependent 
contention  bound  is: 

WuMP,  M,  N )  =  max  II  ■  F~a+1'D^  (6) 

Note  that  F~a+1/D  js  monotonic  (in  the  given  range),  but 
that  the  exponent  can  be  positive,  negative  or  zero.  If  the 
exponent  of  t  is  negative  or  zero,  then  the  expression  is  max¬ 
imized  at  1=1,  reproducing  the  per-processor  bound  (up 
to  a  constant  factor).  If  the  exponent  is  positive,  namely 
D  <  D i  =  1  /(a  —  1),  then  the  expression  is  maximized  at 


1/SHBL 


t  =  P/2,  and  we  obtain  a  new  and  tighter  bound  4: 

WWP,  M,  N)  =  Sl  (pa_1/^Mg_1)  •  (T) 


wpioc{p,n)  =  q 


The  per-processor  memory-independent  bound  is 

Wploc(P,N)  =  Q^J^'Sj  (8) 


assuming  we  drop  the  N/P  term  from  the  bound.  At  t  = 
P/2  (as  again  we  observe  that  the  contention  bound  is  max¬ 
imized  at  either  t  ==  1  or  t  =  P/2),  we  derive  the  memory- 
independent  lower  bound  on  contention 


We  next  apply  Theorem  2.3  to  (8)  and  obtain: 

WuMP,  N)  =  max  SI  (9) 


Wi„k(P,  N)  =  SI 


/|2|1/shbl\ 

V  J  ■ 


Again,  p/°‘-1+1/ D  js  monotonic  and  may  be  positive,  neg¬ 
ative  or  zero.  If  the  exponent  of  t  is  negative  or  zero,  then 
the  expression  is  maximized  at  t  =  1,  reproducing  the  per- 
processor  bound  (up  to  a  constant  factor).  If  the  exponent 
is  positive,  namely  D  <  D2  =  a/(a—l),  then  the  expression 
is  maximized  at  t  =  P/2,  and  we  obtain  a  new  and  tighter 
bound: 

wiink(p,iv)  =  n^^75).  (io) 

Table  1  presents  the  communication  lower  bounds  for  each 
of  the  computations  described  in  Sections  3.1  on  .D-dimensional 
tori  with  the  respective  values  of  F  and  a. 

Programs  that  Reference  Arrays. 

Note  that  if  we  assume  that  F  =  0(Na)  in  the  memory- 
independent  lower  bound  for  programs  that  reference  ar¬ 
rays  with  a  =  shbl,  we  arrive  at  the  form  of  this  bound 
used  for  the  derivation  of  the  direct  linear  algebra,  Strassen, 
Strassen-like  and  0(n2)  n-body  contention  bounds.  In  gen¬ 
eral,  this  does  not  have  to  be  the  case  for  the  set  of  programs 
defined  by  Expression  2  above. 

According  to  Theorem  3.5,  the  memory-dependent  per- 
processor  bandwidth  lower  bound  for  programs  defined  by 
Expression  2  is 


FFT/Sorting. 

As  with  the  previous  algorithms,  we  apply  Theorem  2.3 
to  the  relevant  per-processor  bound  given  in  Section  3.1. 
The  per-processor  memory-independent  bound  is  thus 

We  next  apply  this  bound  to  Theorem  2.3  and  obtain: 

N) ,  iStm  n  {njf/rfyf  m 

Again,  when  t  =  1  we  obtain  the  original  per-processor 
bound.  Equation  12  has  a  stationary  point  at  t  =  PCD /n 
(where  G  is  the  base  of  the  logarithm),  but  via  consideration 
of  the  second  derivative  wrt  to  t,  it  can  be  shown  that  this 
point  is  a  minima  for  all  relevant  values  of  n,P  and  D.  Thus, 
we  can  derive  a  memory-independent  contention  bound  by 
setting  t  =  P/2  (see  Footnote  4): 

ITiink(T>,  N)  —  SI  (13) 


Wploc(P,  M,  N)  =  SI 


|3|  A 

PAP HBL-lJ  ' 


Similar  to  the  derivation  for  the  previous  problems  (albeit 
with  a  =  shbl),  the  bound  becomes 


WuMP,M,N) 


max  Si 

l<t<P/2 


\Z\ 


(pAT>HBL- 


•  SHBL  +  1/D 


which  is  maximized  at  either  t  =  1  (the  per-processor  bound), 
or  t  =  P/2  (see  Footnote  4).  So,  we  obtain 


Wii„k(P,  M,  N)=  SI 


\Z\ 


PSHBL-l/D  Ms  HBL“1 


as  N  =  0(n). 

4.2  Analysis  and  Interpretation 

Which  bound  dominates?. 

Our  first  observation  is  that,  for  these  computations,  the 
memory-independent  contention  bound  dominates  the  memory- 
dependent  contention  bound  for  many  algorithms.  In  the 
cases  of  direct  linear  algebra,  Strassen  and  Strassen-like,  and 
the  0(n2)  n-body  problem  we  prove  this  by  contradiction: 
if  the  memory-dependent  contention  bound  dominates,  then 
the  problem  is  too  large  to  be  distributed  across  all  the  pro¬ 
cessors’  local  memories.  Thus,  if 


as  a  memory-dependent  lower  bound  on  contention.  In  a 
similar  manner,  we  can  derive  a  memory-independent  con¬ 
tention  lower  bound.  From  Equation  (4),  the  memory-independent 
per-processor  bound  is  then,  as  F 


4  Note  that  there  may  not  be  a  subset  of  the  vertices  of  Gjvet 
that  attains  the  small  set  expansion  /it(Gjvet)  of  size  exactly 
P/2.  However,  the  small  set  expansion  of  tori  and  meshes  is 
attained  for  small  sets  of  size  P/c  for  some  constant  c  >  2 
(e.g.  consider  a  sub-tori),  hence  the  following  contention 
analysis  holds  up  to  a  constant  factor. 


F  N 

pa-l/D  yfa-l  >  pl-l/D 

=  e(Na),  we  have 

IVa_i  >  PQ_1Af“_1 

which  is  a  contradiction  as  we  assumed  that  N  <  PM.  For 
programs  that  reference  arrays,  the  proof  requires  a  bit  more 
of  the  theoretical  apparatus  from  [21]  and  is  proven  in  Ap¬ 
pendix  B.  We  note  that  in  practice  the  value  of  constants 
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Table  1:  Per-processor  bounds  ( Wpr0c )  ([26,  11,  9,  12,  15])  vs.  the  new  contention  bounds  ( Wu„k )  on  a  D- 
dimensional  torus  for  classical  linear  algebra,  fast  matrix  multiplication,  0(n2)  n-body,  Fast  Fourier  Transform 
(FFT)  and  a  general  set  of  programs  that  reference  arrays. 


may  result  in  the  memory-dependent  contention  bound  be¬ 
ing  dominant,  despite  the  asymptotic  result. 

For  direct  linear  algebra,  Strassen,  Strassen-like  and  0(n2) 
n-body  algorithms,  Figure  2  illustrates  the  relationships  be¬ 
tween  the  four  types  of  bounds  for  a  fixed  computation,  fixed 
problem  size  N .  and  fixed  local  memory  size  M,  varying  the 
number  of  processors  P  and  the  torus  dimension  D.  See  Ap¬ 
pendix  A  for  the  derivation  of  the  expressions  used  in  Figure 
2. 

Depending  on  the  dimension  of  the  torus  and  number  of 
processors,  the  tightest  bound  may  be  one  of  the  previously 
known  per-processor  bounds  or  the  memory-independent  con¬ 
tention  bound.  We  first  consider  subdividing  the  vertical 
axis  of  Figure  2,  which  corresponds  to  the  torus  dimension 
D.  Intuitively  speaking,  the  smaller  D  is,  the  more  likely 
contention  will  dominate  communication  costs.  For  a  given 
algorithm,  we  let  D  =  [l/(a  —  1)J  =  [DiJ  is  the  maxi¬ 
mum  torus  dimension  such  that  the  communication  cost  is 
dominated  by  contention  for  all  input  and  machine  param¬ 
eters.  Similarly,  we  let  D  =  \a/{u  —  1)]=  \D 2]  be  the  min¬ 
imum  torus  dimension  so  that  the  communication  cost  is 
not  dominated  by  the  contention  (at  least  not  by  the  bound 
proved  here).  Note  that  for  a  combination  of  an  algorithm 
and  a  D-dimensional  torus  such  that  D 1  <  D  <  D2,  either 
the  per-processor  memory-dependent  bound  or  the  memory- 
independent  contention  bound  may  dominate.  See  Table  2 
for  values  of  D\  and  D2  for  various  matrix  multiplication 
algorithms.  In  particular,  note  that  for  the  classical  algo¬ 
rithm,  a  2D  torus  is  not  sufficient  to  avoid  contention.  While 
Cannon’s  algorithm  [18]  does  not  suffer  from  contention  on 
a  2D  torus  network,  it  is  also  not  communication-optimal. 
The  more  communication-efficient  “3D”  algorithms  [14,  4, 
31,  36],  which  utilize  extra  memory  and  have  the  ability 
to  strong  scale  perfectly,  require  a  3D  torus  to  attain  the 
per-processor  lower  bounds.  For  matrix  multiplication  algo¬ 


rithms  with  smaller  exponents,  the  torus  dimension  require¬ 
ments  for  remaining  contention-free  are  even  larger. 

Range  of  perfect  strong  scaling. 

We  next  consider  subdividing  the  horizontal  axis  of  Fig¬ 
ure  2,  which  corresponds  to  the  number  of  processors  P. 
Because  Figure  2  shows  a  fixed  problem  size,  increasing  P 
(moving  to  the  right)  corresponds  to  “strong  scaling.”  We 
differentiate  between  whether  or  not  the  computation  has 
the  possibility  of  strong  scaling  perfectly:  that  is,  for  a  fixed 
problem  size,  increasing  the  number  of  processors  by  a  con¬ 
stant  factor  reduces  the  communication  costs  (and  running 
time)  by  the  same  constant  factor.  Note  that  of  the  bounds, 
the  memory-dependent  per-processor  bound  (Equation  (5)) 
exhibits  this  possibility  of  perfect  strong  scaling,  as  P  ap¬ 
pears  in  the  denominator  with  an  exponent  of  1.  However,  as 
P  increases,  one  of  the  memory-independent  bounds  even¬ 
tually  dominates  and  perfect  strong  scaling  is  no  longer  pos¬ 
sible.  See  [9]  for  a  discussion  of  this  behavior  given  only 
per-processor  bounds. 

For  direct  linear  algebra,  Strassen-like  methods  and  the 
0(n2)  n-body  problem,  when  D  >  D2  and  P  <  (F/NM01^1)0-1  , 
then  the  memory-dependent  per-processor  bound  dominates. 
When  this  happens,  we  have  a  perfect  strong  scaling  range. 

For  values  of  P  beyond  this  range,  the  communication  cost  is 
dominated  by  the  memory-independent  per-processor  bound 
(see  [9]  for  further  discussion).  When  D\  <  D  <  D2,  a 
smaller  strong-scaling  ranges  exists  for  P  <  (F/NM01-1)0 ; 
for  values  of  P  beyond  this  range,  the  communication  cost 
bound  is  dominated  by  contention.  If  D  <  D 1,  then  the 
contention  bounds  always  dominate  and  there  is  no  strong¬ 
scaling  range.  A  similar  analysis  can  demonstrate  such  a 
region  of  perfect  strong  scaling  in  runtime  for  programs  that 
reference  arrays. 

Figure  3  shows  this  behavior  for  Strassen’s  matrix  multi- 


Processors  (P) 

Figure  2:  Relationship  between  the  per-processor  and  contention  communication  lower  bounds  for  direct 
linear  algebra,  Strassen/Strassen-like  and  the  0(n2)  n-body  problems. 
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Classical 

3 
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3 

Strassen  [38] 

«  2.81 

2 

4 

Schonhage  [33] 

«  2.55 

3 

5 

Strassen  [39] 

«  2.48 

4 

6 

Vassilevska  [40] 

ss  2.3727 

5 

7 

Table  2:  Torus  dimensions  so  that  communication 
cost  is  either  always  contention  bound  ( D  <  [DiJ)  or 
never  contention  bound  ( D  >  f Z?2~| )  for  a  selection 
of  matrix  multiplication  algorithms.  The  assertions 
regarding  the  last  three  algorithms  are  under  some 
technical  assumptions  /  conjecture,  see  [12]. 

plication  (where  a  =  (log2  7)/2)  given  the  relevant  torus  di¬ 
mensions.  For  Strassen,  F/NM a~1  =  (JV/M)“"1  =  P“-\ 
where  Pm;TO  is  the  minimum  number  of  processors  required 
to  store  the  problem  as  F  =  0(na).  Note  that  the  lower 
subfigure  in  Figure  3  is  a  log- log  scale,  while  the  upper  sub¬ 
figure’s  y-axis  is  linear.  For  a  good  enough  network  (D  >  4), 
the  perfect  strong  scaling  range  is  Pmin  <  P  <  p£*° f2  7^2  « 
P^'in°.  For  a  3D  torus,  the  perfect  strong  scaling  range 
shrinks  to  Pm in  <  P  <  P*f°B2  [~2)/2  «  ph?1.  On  2D  torus, 
perfect  strong  scaling  is  impossible.  These  three  regions  of 
network  dimension  ( D  >  P>2,  D  <  D i  and  D i  <  D  <  D2) 
are  illustrated  in  Figure  2  as  being  the  points  of  transition 
between  dominance  of  the  various  bounds.  The  upper  por¬ 
tion  of  Figure  3  demonstrates  the  regions  of  dominance  for 
the  various  network  dimensions  in  the  case  of  Strassen’s  al¬ 
gorithm. 

5.  FUTURE  RESEARCH 

Other  Networks. 

In  this  work,  we  exclusively  address  link  contention  bounds 
for  tori  and  mesh  networks.  We  suspect  that  results  for 
hypercubes  and  certain  indirect  networks  (e.g.  fat  trees) 
should  follow  easily.  For  indirect  networks,  a  method  for 


Figure  3:  Communication  bounds  for  Strassen’s  al¬ 
gorithm  on  D-dimensional  tori.  The  lower  plot  is 
log-log,  while  the  upper  is  linear  on  the  y-axis.  Hor¬ 
izontal  lines  in  the  lower  plot  correspond  to  perfect 
strong  scaling. 


integrating  router  nodes  into  the  model  of  computation  is 
needs  to  be  defined.  Indirect  topologies  are  common  in  dat¬ 
acenters  as  well  as  on-chip  networks,  so  such  an  extension  of 
the  contention  bounds  for  direct  networks  would  be  useful. 


Applicability. 

A  network  may  have  expansion  sufficiently  large  to  pre¬ 
clude  the  use  of  our  contention  bound  on  a  given  computa¬ 
tion,  yet  the  contention  may  still  dominate  the  communica¬ 
tion  cost.  This  calls  for  further  study  on  how  well  compu¬ 
tations  and  networks  match  each  other.  Similar  questions 
have  been  addressed  by  Leiserson  and  others  [13,  24,  29], 
and  had  a  large  impact  on  the  design  of  supercomputer  net¬ 
works.  In  particular,  a  parallel  computer  that  uses  a  fat 
tree  communication  network  can  simulate  any  other  routing 
network,  at  the  cost  of  at  most  polylogarithmic  slowdown. 

Communication  Efficient  Algorithms. 

Some  parallel  algorithms  are  network  aware,  and  attain 
the  per-processor  communication  lower  bounds,  when  net¬ 
work  graphs  allow  it  (cf.  [36]  for  classical  matrix  multipli¬ 
cation  on  3D  torus).  Many  algorithms  are  communication 
optimal  when  all-to-all  connectivity  is  assumed,  but  their 
performance  on  other  topologies  has  not  yet  been  studied. 

Are  there  algorithms  that  attain  the  communication  lower 
bounds  for  any  realistic  network  graph  (either  by  auto  tun¬ 
ing,  or  by  network-topology-oblivious  tools)? 
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APPENDIX 

A.  DERIVATION  OF  FIGURE  EXPRESSIONS 

•  Equivalence  point  for  per-processor  bounds 

We  set  the  per-processor  bounds  equal  to  each  other, 
and  solve  for  P\ 

pm “-1  yp1/*) 
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V  NM “-1 


a./ (ct  —  1) 


•  Equivalence  point  for  contention  bounds 

We  set  the  contention  bounds  equal  to  each  other,  and 
solve  for  P: 

_ £ _ =  e(^— 
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•  Equivalence  point  for  the  memory-dependent  per- 
processor  and  memory-independent  contention 
bounds 

We  set  the  memory-dependent  per-processor  and  memory- 
independent  contention  bounds  equal  to  each  other, 
and  solve  for  P  as  a  function  of  D: 


F 
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=  e 
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B.  DOMINANCE  OF  MEMORY-INDEPENDENT 
CONTENTION  BOUND 


Claim  B.l.  Let  Alg  be  an  algorithm  performing  a  com¬ 
putation  of  the  form  given  by  equation  (2)  on  P  processors, 
each  with  local  memory  of  size  M ,  and  assume  the  input 
data  is  initially  evenly  distributed  across  processors.  Then, 

|Z|i/mBL  "  j^(g) | 

M  ~  ^  M  ' 

r=i 

As  the  minimum  number  of  processors  required  to  hold  the 
problem  is  the  right-hand  side  of  this  inequality,  we  conclude 
that  the  memory-independent  contention  bound  dominates 
the  memory-dependent  contention  bound  as  the  two  bounds 
are  equivalent  when  P  —  |.2|1/SHBL/M. 

Proof.  To  begin  a  proof,  the  HBL  bound  discussed  in 
Christ  et  al.  [21],  states  (with  certain  assumptions)  that 

771 
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To  detail  an  argument  from  Section  2  of  [21],  we  present 
several  greater  upper  bounds  on  \Z\  that  will  allow  us  to 
demonstrate  the  desired  result: 
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As  max^!  Xj  <  J2T=i  xi  ^  xi  —  0’ 
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which  proves  the  desired  inequality  if  we  take  SHBLth  root 
of  both  sides  and  divide  by  M.  □ 


